Bioinformatics An Introduction 4th Edition (Jeremy Ramsden)

11 Randomness and Complexity

123

incompressible; that is, it cannot be recreated by any means shorter than the process

actually used to generate it in the ﬁrst place.

We have previously stated that bioinformatics could be considered to be the study

of the departures from randomness of DNA. We are shown a sequence of DNA:

Is it random? We want to be able to quantify its departure from randomness. Pre-

sumably those sequences belonging to viable organisms, or even to their individual

proteins or promoter sequences, are not random. What about introns, and intergenome

sequences? If they are indeed “junk”, as is sometimes (facetiously?) asserted, then

we might well expect them to be random. Even if they started their existence as

nonrandom sequences, they may have been randomized since they would be subject

to virtually no selection pressure. Mutations are supposed to be random and occur at

random places. The opposite procedure would be that all DNA sequences started as

random ones and then natural selection eliminated many according to some system-

atic criterion; therefore, the extant collection of the DNA of viable organisms on this

planet is not random. Can we, then, say anything about the randomness or otherwise

of an individual sequence taken in isolation?

Similar considerations apply to proteins. Given a collection of amino acid

sequences of proteins (which, to be meaningful, should come from the same genome),

we can assess the likelihood that they arose by chance and the degree of their depar-

tures from randomness.

All such sequences can be idealized as sequences of Bernoulli trials (see

Sect. 9.2.3), which are themselves abstractions of a coin tossing experiment. Since

order does not matter in determining the probability of a given overall outcome, 50

heads followed by 50 tails has the same probability of occurring as 50 alternations

of heads and tails, which again is no less probable than a particular realization in

which the heads and tails are “randomly” mixed.

Any nonbinary sequence can, of course, be encoded in binary form. Typical

procedures for biological sequences (amino acids or nucleotides) are to consider

nucleotides as purines (0) or pyrimidines (1), or amino acids as hydrophobic (apolar)

or hydrophilic (polar) residues (cf. Markov’s encoding of poetry as a sequence of

vowels and consonants). Alternatively, the nucleotides could constitute a sequence

in base 4 (A identical to≡0, C identical to≡1, T identical to≡2, G identical to≡3), which can then be converted to base 2.

It is a commonly held belief that after a long sequence of heads (say), the opposite

result (tails) becomes more probable. There is no empirical support for this assertion

in the case of coin tossing. In other situations in which the outcome depends on

selecting elements from a ﬁnite reservoir, however, clearly this result must hold.

Thus, if a piece of DNA is being assembled from a soup of base monomers at

initially equal concentrations, if by chance the sequence starts out by being poor in

A, say, then later on this must be compensated by enrichment (chain elongation ends

when all available nucleotides have been consumed).

Formal Notions of Randomness In order to proceed further, we need to more

carefully understand what we mean by randomness. Despite the fact that the man in

the street supposes that he has a good idea of what it means, randomness is a rather

delicate concept. The toss of an unbiased coin is said to be random; the probability